Isaac_Muturi_assessment
**Project 1: Fine-Tuning a Hugging Face Transformer Model for Code Generation** In this project, I embarked on the journey of fine-tuning a powerful Hugging Face Transformer model for code generation. Leveraging the CodeAlpaca 20k dataset, I utilized Progressive Embedding Fine-Tuning (PEFT) and quantization techniques to train this Large Language Model. The process began with the pre-processing of the model to optimize memory usage and then proceeded to the fine-tuning phase. The model was exposed to various code-related prompts, enhancing its ability to generate code snippets effectively. This project resulted in a well-trained model capable of generating code based on natural language prompts, opening the door to applications in code automation and software development. **Project 2: AI-Powered Data Extraction and Content Retrieval with GenAI Stack** In my second project, I harnessed the capabilities of GenAI Stack, a versatile framework for AI-based content retrieval and data extraction. By integrating various components, including Langchain ETL for data extraction, Hugging Face Embeddings for natural language understanding, and Langchain Retriever for content retrieval, I developed a robust system for answering specific queries. The project involved the extraction of information from online sources, making it invaluable for information retrieval tasks. Furthermore, the system's ability to understand and answer natural language questions demonstrates its potential in various domains, from chatbots to knowledge management systems. This project opens the door to AI-powered content extraction and retrieval solutions, paving the way for more advanced applications of AI in information management.
Tags:
#deep-learning
NAME: Isaac Ndirangu Muturi
EMAIL: ndirangumuturi749@gmail.com
Installation of Required Libraries
!pip install -q -U torch
!pip install -q -U transformers
!pip install -q -U datasets
!pip install -q -U trl
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U bitsandbytes
!pip install -q -U accelerate
!pip install -q -U huggingface_hub
!pip install -q -U wandb
!pip install -q -U einops scipy
Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... done ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. tokenizers 0.14.1 requires huggingface_hub<0.18,>=0.16.4, but you have huggingface-hub 0.18.0 which is incompatible. ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.6/44.6 kB 1.6 MB/s eta 0:00:00
Git Credential Helper Configuration
Please run this cell to configure Git credential helper for secure access to repositories.
!git config --global credential.helper store
Hugging Face Hub Login
Run the following cell to log in to your Hugging Face Hub account.
from huggingface_hub import notebook_login
notebook_login()
VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…
Load the Training Dataset
In this cell, we are loading the training dataset named "evol-codealpaca-v1" from the Hugging Face Datasets library and specifying the split as "train." The dataset is stored in the variable dataset
. We then print the contents of the dataset.
from datasets import load_dataset
# Load your dataset
dataset = load_dataset("theblackcat102/evol-codealpaca-v1", split="train")
print(dataset)
Downloading readme: 0%| | 0.00/2.17k [00:00<?, ?B/s]
Downloading data files: 0%| | 0/1 [00:00<?, ?it/s]
Downloading data: 0%| | 0.00/255M [00:00<?, ?B/s]
Extracting data files: 0%| | 0/1 [00:00<?, ?it/s]
Generating train split: 0 examples [00:00, ? examples/s]
Dataset({ features: ['instruction', 'output'], num_rows: 111272 })
Fine Tuning
This code is for fine-tuning a Salesforce CodeGen model using specific quantization and memory optimization settings. It also utilizes the TRL (Text-to-Text Transfer Transformer) and PEFT (Progressive Embedding Fine-Tuning) techniques for training. The training process and memory usage have been optimized for the specified model and dataset.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM
from peft import LoraConfig
from transformers import BitsAndBytesConfig
# Load the desired model with quantization
model_name = "Salesforce/codegen-350M-mono"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
trust_remote_code=True
)
model.config.use_cache = False
# Load the desired tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
# Define the formatting function
def formatting_prompts_func(example):
output_texts = []
for i in range(len(example['instruction'])):
text = f"### Question: {example['instruction'][i]}\n ### Answer: {example['output'][i]}"
output_texts.append(text)
return output_texts
response_template = " ### Answer:"
collator = DataCollatorForCompletionOnlyLM(response_template, tokenizer=tokenizer)
# Define your PEFT configuration
peft_config = LoraConfig(
r=16, # Reducing this value to 16 for memory optimization
lora_alpha=16,
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM"
)
# Define TrainingArguments for optimal training
training_args = TrainingArguments(
output_dir="./results",
overwrite_output_dir=True,
per_device_train_batch_size=2, # Reducing batch size to 2 for memory optimization
logging_steps=500,
save_steps=1000,
num_train_epochs=1,
optim="paged_adamw_32bit",
warmup_ratio=0.1,
lr_scheduler_type="linear",
fp16=True,
max_grad_norm=0.3,
max_steps = -1,
gradient_accumulation_steps=1, # Reducing gradient accumulation steps to 1 for memory optimization
)
# Create the SFTTrainer with training arguments
trainer = SFTTrainer(
model,
train_dataset=dataset,
formatting_func=formatting_prompts_func,
peft_config=peft_config,
args=training_args,
max_seq_length=512
)
# Pre-process the model by upcasting the layer norms in float32 for more stable training
for name, module in trainer.model.named_modules():
if "norm" in name:
module = module.to(torch.float32)
# Train the model
trainer.train()
(…)degen-350M-mono/resolve/main/config.json: 0%| | 0.00/999 [00:00<?, ?B/s]
pytorch_model.bin: 0%| | 0.00/797M [00:00<?, ?B/s]
(…)-mono/resolve/main/tokenizer_config.json: 0%| | 0.00/240 [00:00<?, ?B/s]
(…)odegen-350M-mono/resolve/main/vocab.json: 0%| | 0.00/798k [00:00<?, ?B/s]
(…)odegen-350M-mono/resolve/main/merges.txt: 0%| | 0.00/456k [00:00<?, ?B/s]
(…)en-350M-mono/resolve/main/tokenizer.json: 0%| | 0.00/2.11M [00:00<?, ?B/s]
(…)350M-mono/resolve/main/added_tokens.json: 0%| | 0.00/1.00k [00:00<?, ?B/s]
(…)ono/resolve/main/special_tokens_map.json: 0%| | 0.00/90.0 [00:00<?, ?B/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Map: 0%| | 0/111272 [00:00<?, ? examples/s]
wandb: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server) wandb: You can find your API key in your browser here: https://wandb.ai/authorize wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:
··········
wandb: Appending key for api.wandb.ai to your netrc file: /root/.netrc
/content/wandb/run-20231022_163239-n6mfswlg
You're using a CodeGenTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Step | Training Loss |
---|---|
100 | 2.196400 |
200 | 2.065200 |
300 | 1.993900 |
400 | 1.966300 |
500 | 2.026900 |
600 | 1.929500 |
700 | 1.908900 |
800 | 2.032900 |
900 | 2.008700 |
1000 | 1.928300 |
1100 | 1.942300 |
1200 | 1.951000 |
1300 | 1.971100 |
1400 | 1.853000 |
1500 | 1.960200 |
1600 | 1.920000 |
1700 | 1.954600 |
1800 | 1.875700 |
1900 | 1.913100 |
2000 | 1.778800 |
2100 | 1.910200 |
2200 | 2.132900 |
2300 | 2.047700 |
2400 | 1.952600 |
2500 | 1.856200 |
2600 | 1.980300 |
2700 | 1.765700 |
2800 | 1.825100 |
2900 | 1.886700 |
3000 | 1.957300 |
3100 | 1.916800 |
3200 | 1.988300 |
3300 | 1.907900 |
3400 | 1.966500 |
3500 | 1.869500 |
3600 | 1.808900 |
3700 | 1.952700 |
3800 | 1.908900 |
3900 | 1.773700 |
4000 | 1.819200 |
4100 | 1.793800 |
4200 | 1.872000 |
4300 | 1.855200 |
4400 | 1.822200 |
4500 | 1.747200 |
4600 | 1.877200 |
4700 | 1.864900 |
4800 | 1.846400 |
4900 | 1.767000 |
5000 | 1.777900 |
5100 | 1.871100 |
5200 | 1.912200 |
5300 | 1.872900 |
5400 | 1.873700 |
5500 | 1.820900 |
5600 | 1.822000 |
5700 | 1.851900 |
5800 | 1.791500 |
5900 | 1.888400 |
6000 | 1.805100 |
6100 | 1.816500 |
6200 | 1.796300 |
6300 | 1.750000 |
6400 | 1.857900 |
6500 | 1.904700 |
6600 | 1.824300 |
6700 | 1.849500 |
6800 | 1.896500 |
6900 | 1.785800 |
7000 | 1.864500 |
7100 | 1.814400 |
7200 | 1.736900 |
7300 | 1.804300 |
7400 | 1.783700 |
7500 | 1.830000 |
7600 | 1.834900 |
7700 | 1.794000 |
7800 | 1.750100 |
7900 | 1.786600 |
8000 | 1.760400 |
8100 | 1.845900 |
8200 | 1.840200 |
8300 | 1.812400 |
8400 | 1.794300 |
8500 | 1.777200 |
8600 | 1.787000 |
8700 | 1.759300 |
8800 | 1.757100 |
8900 | 1.866700 |
9000 | 1.816900 |
9100 | 1.897200 |
9200 | 1.821600 |
9300 | 1.744600 |
9400 | 1.819400 |
9500 | 1.786400 |
9600 | 1.870900 |
9700 | 1.902500 |
9800 | 1.836200 |
9900 | 1.805000 |
10000 | 1.900200 |
10100 | 1.797700 |
10200 | 1.766500 |
10300 | 1.797200 |
10400 | 1.863000 |
10500 | 1.776900 |
10600 | 1.806600 |
10700 | 1.775800 |
10800 | 1.831300 |
10900 | 1.696500 |
11000 | 1.809200 |
11100 | 1.806900 |
11200 | 1.727500 |
11300 | 1.778300 |
11400 | 1.747700 |
11500 | 1.706000 |
11600 | 1.726900 |
11700 | 1.924800 |
11800 | 1.837300 |
11900 | 1.659400 |
12000 | 1.819500 |
12100 | 1.719300 |
12200 | 1.787300 |
12300 | 1.821000 |
12400 | 1.768000 |
12500 | 1.818600 |
12600 | 1.791400 |
12700 | 1.838800 |
12800 | 1.804900 |
12900 | 1.743200 |
13000 | 1.767600 |
13100 | 1.754200 |
13200 | 1.664800 |
13300 | 1.753600 |
13400 | 1.762300 |
13500 | 1.784500 |
13600 | 1.694200 |
13700 | 1.818000 |
13800 | 1.842000 |
13900 | 1.760800 |
14000 | 1.659700 |
14100 | 1.771400 |
14200 | 1.815200 |
14300 | 1.768100 |
14400 | 1.842600 |
14500 | 1.721300 |
14600 | 1.661200 |
14700 | 1.785800 |
14800 | 1.730500 |
14900 | 1.760800 |
15000 | 1.780600 |
15100 | 1.717700 |
15200 | 1.813800 |
15300 | 1.772200 |
15400 | 1.757900 |
15500 | 1.745800 |
15600 | 1.794100 |
15700 | 1.773200 |
15800 | 1.765700 |
15900 | 1.883500 |
16000 | 1.764500 |
16100 | 1.723600 |
16200 | 1.867700 |
16300 | 1.794000 |
16400 | 1.856000 |
16500 | 1.683500 |
16600 | 1.770600 |
16700 | 1.721700 |
16800 | 1.711000 |
16900 | 1.747100 |
17000 | 1.770800 |
17100 | 1.750700 |
17200 | 1.761200 |
17300 | 1.820700 |
17400 | 1.799400 |
17500 | 1.780800 |
17600 | 1.638700 |
17700 | 1.689100 |
17800 | 1.760400 |
17900 | 1.851400 |
18000 | 1.695300 |
18100 | 1.766300 |
18200 | 1.729100 |
18300 | 1.886900 |
18400 | 1.843700 |
18500 | 1.703300 |
18600 | 1.809000 |
18700 | 1.602200 |
18800 | 1.785700 |
18900 | 1.713400 |
19000 | 1.665200 |
19100 | 1.715400 |
19200 | 1.769900 |
19300 | 1.674300 |
19400 | 1.780000 |
19500 | 1.811500 |
19600 | 1.674400 |
19700 | 1.681000 |
19800 | 1.736300 |
19900 | 1.749400 |
20000 | 1.796600 |
20100 | 1.727000 |
20200 | 1.696600 |
20300 | 1.640100 |
20400 | 1.835300 |
20500 | 1.739400 |
20600 | 1.680200 |
20700 | 1.888800 |
20800 | 1.785200 |
20900 | 1.718400 |
21000 | 1.787700 |
21100 | 1.812500 |
21200 | 1.648800 |
21300 | 1.792600 |
21400 | 1.713400 |
21500 | 1.705600 |
21600 | 1.787900 |
21700 | 1.756600 |
21800 | 1.846000 |
21900 | 1.912000 |
22000 | 1.754200 |
22100 | 1.781300 |
22200 | 1.727700 |
22300 | 1.696100 |
22400 | 1.809500 |
22500 | 1.629000 |
22600 | 1.842900 |
22700 | 1.711700 |
22800 | 1.844000 |
22900 | 1.817200 |
23000 | 1.707800 |
23100 | 1.683700 |
23200 | 1.781800 |
23300 | 1.737800 |
23400 | 1.655300 |
23500 | 1.721800 |
23600 | 1.811800 |
23700 | 1.741100 |
23800 | 1.690700 |
23900 | 1.812800 |
24000 | 1.629000 |
24100 | 1.733900 |
24200 | 1.788900 |
24300 | 1.658500 |
24400 | 1.647800 |
24500 | 1.826100 |
24600 | 1.708600 |
24700 | 1.775800 |
24800 | 1.773600 |
24900 | 1.739700 |
25000 | 1.694600 |
25100 | 1.647200 |
25200 | 1.753700 |
25300 | 1.758700 |
25400 | 1.754800 |
25500 | 1.738700 |
25600 | 1.747500 |
25700 | 1.682700 |
25800 | 1.653200 |
25900 | 1.756400 |
26000 | 1.753200 |
26100 | 1.643800 |
26200 | 1.718100 |
26300 | 1.761200 |
26400 | 1.773100 |
26500 | 1.778800 |
26600 | 1.718800 |
26700 | 1.812900 |
26800 | 1.783400 |
26900 | 1.794800 |
27000 | 1.697600 |
27100 | 1.650200 |
27200 | 1.738300 |
27300 | 1.757200 |
27400 | 1.791500 |
27500 | 1.668400 |
27600 | 1.829900 |
27700 | 1.685900 |
27800 | 1.741600 |
27900 | 1.650300 |
28000 | 1.767200 |
28100 | 1.753600 |
28200 | 1.758300 |
28300 | 1.640500 |
28400 | 1.747900 |
28500 | 1.761000 |
28600 | 1.673200 |
28700 | 1.708700 |
28800 | 1.738200 |
28900 | 1.726300 |
29000 | 1.664300 |
29100 | 1.701400 |
29200 | 1.689500 |
29300 | 1.750900 |
29400 | 1.757000 |
29500 | 1.828200 |
29600 | 1.694400 |
29700 | 1.718300 |
29800 | 1.729000 |
29900 | 1.760500 |
30000 | 1.705700 |
30100 | 1.810600 |
30200 | 1.747600 |
30300 | 1.668800 |
30400 | 1.776300 |
30500 | 1.737700 |
30600 | 1.764300 |
30700 | 1.724500 |
30800 | 1.804500 |
30900 | 1.721000 |
31000 | 1.793100 |
31100 | 1.673700 |
31200 | 1.788200 |
31300 | 1.796600 |
31400 | 1.739800 |
31500 | 1.787400 |
31600 | 1.780500 |
31700 | 1.821200 |
31800 | 1.733500 |
31900 | 1.712300 |
32000 | 1.801100 |
32100 | 1.698800 |
32200 | 1.758100 |
32300 | 1.769600 |
32400 | 1.673600 |
32500 | 1.656500 |
32600 | 1.667600 |
32700 | 1.703200 |
32800 | 1.737100 |
32900 | 1.703200 |
33000 | 1.628000 |
33100 | 1.673300 |
33200 | 1.707500 |
33300 | 1.717600 |
33400 | 1.823700 |
33500 | 1.754200 |
33600 | 1.788500 |
33700 | 1.770600 |
33800 | 1.828100 |
33900 | 1.734500 |
34000 | 1.659800 |
34100 | 1.745500 |
34200 | 1.738700 |
34300 | 1.769600 |
34400 | 1.780200 |
34500 | 1.788400 |
34600 | 1.641400 |
34700 | 1.765100 |
34800 | 1.705800 |
34900 | 1.763800 |
35000 | 1.646700 |
35100 | 1.747200 |
35200 | 1.655200 |
35300 | 1.687900 |
35400 | 1.625400 |
35500 | 1.669400 |
35600 | 1.739100 |
35700 | 1.779600 |
35800 | 1.683100 |
35900 | 1.641900 |
36000 | 1.653200 |
36100 | 1.809100 |
36200 | 1.752800 |
36300 | 1.694800 |
36400 | 1.679200 |
36500 | 1.679200 |
36600 | 1.723000 |
36700 | 1.688900 |
36800 | 1.755600 |
36900 | 1.692700 |
37000 | 1.693100 |
37100 | 1.654000 |
37200 | 1.752200 |
37300 | 1.665400 |
37400 | 1.694200 |
37500 | 1.742600 |
37600 | 1.700800 |
37700 | 1.706500 |
37800 | 1.642600 |
37900 | 1.720100 |
38000 | 1.789800 |
38100 | 1.725900 |
38200 | 1.718300 |
38300 | 1.732400 |
38400 | 1.687100 |
38500 | 1.728200 |
38600 | 1.750600 |
38700 | 1.820000 |
38800 | 1.746000 |
38900 | 1.666300 |
39000 | 1.654000 |
39100 | 1.707900 |
39200 | 1.641300 |
39300 | 1.800300 |
39400 | 1.730000 |
39500 | 1.786000 |
39600 | 1.734300 |
39700 | 1.716200 |
39800 | 1.787800 |
39900 | 1.703300 |
40000 | 1.803200 |
40100 | 1.674000 |
40200 | 1.656600 |
40300 | 1.692300 |
40400 | 1.777500 |
40500 | 1.665400 |
40600 | 1.701500 |
40700 | 1.680500 |
40800 | 1.755300 |
40900 | 1.704400 |
41000 | 1.725100 |
41100 | 1.725500 |
41200 | 1.716900 |
41300 | 1.809700 |
41400 | 1.630000 |
41500 | 1.715700 |
41600 | 1.744100 |
41700 | 1.600800 |
41800 | 1.685200 |
41900 | 1.765300 |
42000 | 1.718700 |
42100 | 1.729200 |
42200 | 1.817600 |
42300 | 1.684500 |
42400 | 1.740100 |
42500 | 1.695300 |
42600 | 1.714700 |
42700 | 1.756900 |
42800 | 1.774800 |
42900 | 1.645900 |
43000 | 1.830700 |
43100 | 1.692800 |
43200 | 1.726900 |
43300 | 1.781900 |
43400 | 1.696200 |
43500 | 1.780200 |
43600 | 1.729900 |
43700 | 1.698200 |
43800 | 1.674400 |
43900 | 1.734200 |
44000 | 1.734900 |
44100 | 1.652600 |
44200 | 1.684000 |
44300 | 1.717200 |
44400 | 1.602500 |
44500 | 1.801200 |
44600 | 1.708900 |
44700 | 1.664700 |
44800 | 1.726500 |
44900 | 1.806900 |
45000 | 1.762500 |
45100 | 1.756900 |
45200 | 1.731100 |
45300 | 1.699100 |
45400 | 1.754100 |
45500 | 1.691300 |
45600 | 1.710700 |
45700 | 1.690400 |
45800 | 1.715700 |
45900 | 1.730100 |
46000 | 1.638300 |
46100 | 1.632600 |
46200 | 1.653100 |
46300 | 1.578100 |
46400 | 1.604300 |
46500 | 1.719400 |
46600 | 1.716000 |
46700 | 1.735600 |
46800 | 1.631700 |
46900 | 1.769800 |
47000 | 1.689000 |
47100 | 1.683600 |
47200 | 1.772900 |
47300 | 1.608800 |
47400 | 1.615400 |
47500 | 1.740400 |
47600 | 1.714800 |
47700 | 1.730600 |
47800 | 1.731400 |
47900 | 1.786500 |
48000 | 1.671500 |
48100 | 1.685400 |
48200 | 1.750800 |
48300 | 1.721400 |
48400 | 1.735500 |
48500 | 1.681100 |
48600 | 1.664100 |
48700 | 1.766900 |
48800 | 1.805900 |
48900 | 1.705700 |
49000 | 1.681200 |
49100 | 1.816800 |
49200 | 1.742000 |
49300 | 1.686000 |
49400 | 1.668000 |
49500 | 1.544200 |
49600 | 1.666500 |
49700 | 1.735100 |
49800 | 1.780000 |
49900 | 1.636000 |
50000 | 1.693800 |
50100 | 1.725100 |
50200 | 1.696400 |
50300 | 1.646900 |
50400 | 1.672400 |
50500 | 1.760300 |
50600 | 1.728100 |
50700 | 1.638700 |
50800 | 1.645800 |
50900 | 1.727200 |
51000 | 1.720900 |
51100 | 1.674000 |
51200 | 1.752300 |
51300 | 1.665300 |
51400 | 1.740100 |
51500 | 1.686300 |
51600 | 1.683600 |
51700 | 1.764200 |
51800 | 1.674100 |
51900 | 1.877300 |
52000 | 1.726600 |
52100 | 1.682200 |
52200 | 1.763100 |
52300 | 1.791400 |
52400 | 1.804600 |
52500 | 1.730900 |
52600 | 1.694300 |
52700 | 1.754900 |
52800 | 1.768100 |
52900 | 1.705500 |
53000 | 1.689200 |
53100 | 1.750800 |
53200 | 1.586200 |
53300 | 1.704500 |
53400 | 1.727200 |
53500 | 1.767500 |
53600 | 1.712500 |
53700 | 1.766200 |
53800 | 1.621100 |
53900 | 1.773000 |
54000 | 1.644500 |
54100 | 1.692900 |
54200 | 1.751100 |
54300 | 1.675100 |
54400 | 1.727000 |
54500 | 1.748400 |
54600 | 1.683500 |
54700 | 1.781300 |
54800 | 1.643800 |
54900 | 1.777300 |
55000 | 1.679300 |
55100 | 1.752200 |
55200 | 1.755600 |
55300 | 1.584800 |
55400 | 1.763800 |
55500 | 1.738800 |
55600 | 1.682600 |
TrainOutput(global_step=55636, training_loss=1.7573995420853747, metrics={'train_runtime': 17617.2965, 'train_samples_per_second': 6.316, 'train_steps_per_second': 3.158, 'total_flos': 9.971404632332698e+16, 'train_loss': 1.7573995420853747, 'epoch': 1.0})
This code cell saves the trained model to a directory named "outputs." It also checks for distributed or parallel training and handles the saving process accordingly.
model_to_save = trainer.model.module if hasattr(trainer.model, 'module') else trainer.model
# Take care of distributed/parallel training
model_to_save.save_pretrained("outputs")
In this code cell, the model's configuration for Progressive Embedding Fine-Tuning (PEFT) is loaded from the "outputs" directory using the LoraConfig class. Then, a new model is instantiated with PEFT applied, using the loaded model and the PEFT configuration.
from peft import get_peft_model
lora_config = LoraConfig.from_pretrained('outputs')
model = get_peft_model(model, lora_config)
In this code cell, the provided prompts are used to generate responses from the model. The model is moved to the GPU device specified by device
. The responses are generated for each prompt, ensuring they do not exceed the maximum token limit specified by max_token_limit
. The generated responses are then printed for each prompt.
device = "cuda:0"
# Move the model to the GPU
model.to(device)
prompts = [
"Add 5 and 7.",
"Multiply 3 by 9.",
"Write code to print 'Hello, world!' in Python.",
"Calculate the square root of 16.",
"Find the result of 12 divided by 4.",
"Write a Python program to check if a number is even or odd.",
]
# Initialize an empty list to store the model's responses
responses = []
# Maximum token limit for responses
max_token_limit = 100 # Adjust this limit as needed
# Loop through the prompts and generate responses
for prompt in prompts:
inputs = tokenizer(prompt, return_tensors="pt").to(device)
outputs = model.generate(**inputs, max_length=max_token_limit, num_return_sequences=1, no_repeat_ngram_size=2)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
responses.append(response)
# Print the responses
for i, response in enumerate(responses):
print(f"PROMPT {i + 1}:\n{prompts[i]}\nRESPONSE:\n{response}\n")
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation. Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation. Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation. Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation. Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation. Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
PROMPT 1: Add 5 and 7. RESPONSE: Add 5 and 7. # print(f"The sum of the numbers is {sum(numbers)}") n = int(input("Enter the number of elements: ")) for i in range(0, n): print("Element: ", end="") element = input() PROMPT 2: Multiply 3 by 9. RESPONSE: Multiply 3 by 9. # def multiply(a, b): def multiply_3(x, y): return x * y print(multipy(3, 9)) PROMPT 3: Write code to print 'Hello, world!' in Python. RESPONSE: Write code to print 'Hello, world!' in Python. # print('Hello', 'world!') """ Output: Hello world! """ PROMPT 4: Calculate the square root of 16. RESPONSE: Calculate the square root of 16. # In[ ]: import math print(math.sqrt(16)) PROMPT 5: Find the result of 12 divided by 4. RESPONSE: Find the result of 12 divided by 4. # In[ ]: def divide(x, y): return x / y print(divide(12, 4)) PROMPT 6: Write a Python program to check if a number is even or odd. RESPONSE: Write a Python program to check if a number is even or odd. # def is_even(num): # if num % 2 == 0: # return True # else: #return False def isOdd(number): if number %2 ==0: return True else: return False
Conclusion:
Overall, the model performed decently.
The responses contain both relevant information based on the prompts and unrelated code.
It seems that the model generated code beyond the desired response.
In the next steps, we may want to train for more epochs to get better results.
.................................................................................................
Link to the data used: "https://github.com/DataTalksClub/machine-learning-zoomcamp"
Install Required Packages¶
!pip install -q -U git+https://github.com/aiplanethub/genai-stack.git
!pip install -q -U langchain
Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... done
Set Up OpenAI API Key¶
To use the OpenAI API for this project, make sure to set up your API key. You can do this by running the following code snippet, which securely prompts you for your API key and stores it in the environment variable
import os
from getpass import getpass
api_key = getpass("Enter OpenAI API Key:")
os.environ['OPENAI_API_KEY'] = api_key
Enter OpenAI API Key:··········
GenAI Stack Modules¶
The code below imports various modules from the GenAI Stack, a comprehensive library for AI and machine learning applications. Each module plays a specific role in different stages of an AI project, from data preprocessing and embedding to memory management and retrieval. These modules enable the development and deployment of sophisticated AI models efficiently.
from genai_stack.stack.stack import Stack
from genai_stack.etl.langchain import LangchainETL
from genai_stack.embedding.langchain import LangchainEmbedding
from genai_stack.vectordb.chromadb import ChromaDB
from genai_stack.prompt_engine.engine import PromptEngine
from genai_stack.model.gpt3_5 import OpenAIGpt35Model
from genai_stack.retriever.langchain import LangChainRetriever
from genai_stack.memory.langchain import ConversationBufferMemory
Data Extraction and Transformation (ETL)¶
The code initiates an ETL (Extract, Transform, Load) process using the LangchainETL module from the GenAI Stack. It creates an ETL instance named "WebBaseLoader" to extract data from a list of websites specified in the "websites" variable. This step is fundamental for preparing data from web sources for subsequent AI and machine learning tasks.
# Create a list of websites for ETL
websites = [
"https://github.com/DataTalksClub/machine-learning-zoomcamp"
]
etl = LangchainETL.from_kwargs(name="WebBaseLoader",
fields={"web_path": websites
}
)
Embedding Configuration¶
The code defines an embedding configuration using the "config" dictionary. It specifies the "model_name" for embedding, which in this case is set to "sentence-transformers/all-mpnet-base-v2." Additionally, it includes model-related arguments such as "device" set to "cpu," and "encode_kwargs" with "normalize_embeddings" set to "False." This configuration is used to create an embedding instance named "HuggingFaceEmbeddings" using the LangchainEmbedding module. Embeddings are crucial for representing and processing text data in various NLP tasks.
config = {
"model_name": "sentence-transformers/all-mpnet-base-v2",
"model_kwargs": {"device": "cpu"},
"encode_kwargs": {"normalize_embeddings": False},
}
embedding = LangchainEmbedding.from_kwargs(name="HuggingFaceEmbeddings", fields=config)
ChromaDB Initialization¶
The code snippet initializes the ChromaDB, a database instance for storing and managing vectors and data. It utilizes the "ChromaDB.from_kwargs()" method to create the database. ChromaDB is commonly used in machine learning and data retrieval tasks for efficient storage and retrieval of vectors, embeddings, and other data relevant to the model's operations.
chromadb = ChromaDB.from_kwargs()
OpenAI GPT-3.5 Model Initialization¶
The code initializes an instance of the OpenAI GPT-3.5 model. It utilizes the "OpenAIGpt35Model.from_kwargs()" method with parameters including the OpenAI API key, ensuring that the model can access the necessary resources and APIs for language understanding and generation tasks. GPT-3.5 is a powerful language model developed by OpenAI, capable of various natural language processing tasks.
llm = OpenAIGpt35Model.from_kwargs(parameters={"openai_api_key": api_key})
Creating an AI Stack¶
This code block is responsible for creating a comprehensive AI stack. It assembles various components required for natural language understanding, processing, and generation tasks. The stack comprises the following elements:
etl
: Data extraction, transformation, and loading (ETL) components.embedding
: Language embeddings for understanding and encoding text.vectordb
: A vector database (ChromaDB) for efficient storage and retrieval of vectorized data.model
: The language model (GPT-3.5) by OpenAI, used for generating human-like text.prompt_engine
: An engine that assists in formulating appropriate prompts for the model.retriever
: A component for retrieving specific information or data.memory
: A memory buffer for storing and managing conversations or interactions.
Together, these components form an AI stack suitable for various language-related tasks, from data processing to conversational AI.
retriever=LangChainRetriever.from_kwargs()
Stack(
etl=etl,
embedding=embedding,
vectordb=chromadb,
model=llm,
prompt_engine=PromptEngine.from_kwargs(should_validate=False),
retriever=retriever,
memory=ConversationBufferMemory.from_kwargs(),
)
<genai_stack.stack.stack.Stack at 0x79649658f220>
etl.run()
Evaluation¶
The provided code segment performs the following actions:
- It defines a list of prompts, which are questions or queries related to "ML Zoomcamp."
- It iterates through each prompt and retrieves answers to these prompts using the retriever object.
- For each prompt, it prints the prompt itself and the retrieved answer to the console, separating them with "PROMPT" and "ANSWER" labels.
prompts = [
"Could you provide an overview of ML Zoomcamp? What is its primary focus, and what can participants expect to learn from the program?",
"What are the key concepts and topics covered in ML Zoomcamp's curriculum?",
"To obtain a certificate from ML Zoomcamp, what are the specific requirements that participants need to fulfill?"
]
for prompt in prompts:
response = retriever.retrieve(prompt)
print("PROMPT:", prompt)
print("ANSWER:", response['output'])
print("\n")
PROMPT: Could you provide an overview of ML Zoomcamp? What is its primary focus, and what can participants expect to learn from the program? ANSWER: ML Zoomcamp is a free online program offered by DataTalksClub that focuses on teaching participants about machine learning engineering. The program is designed to be completed in four months and covers various topics related to machine learning. Participants can expect to learn about the fundamentals of machine learning, regression and classification techniques, evaluation metrics, deploying machine learning models, decision trees and ensemble learning, neural networks and deep learning, serverless deep learning, and Kubernetes and TensorFlow serving. The program also includes hands-on projects and homework assignments to reinforce the learning. By the end of the program, participants will have gained practical skills in machine learning engineering and be able to apply their knowledge to real-world projects. PROMPT: What are the key concepts and topics covered in ML Zoomcamp's curriculum? ANSWER: The key concepts and topics covered in ML Zoomcamp's curriculum include: 1. Introduction to Machine Learning: This module provides an overview of machine learning, including supervised learning, the CRISP-DM process, model selection, and setting up the environment. 2. Machine Learning for Regression: Participants learn about regression techniques, data preparation, exploratory data analysis, linear regression, feature engineering, regularization, and model tuning. 3. Machine Learning for Classification: This module focuses on classification techniques, data preparation, feature importance, logistic regression, model interpretation, and using the model. 4. Evaluation Metrics for Classification: Participants learn about evaluation metrics such as accuracy, precision, recall, ROC curves, and cross-validation. 5. Deploying Machine Learning Models: This module covers saving and loading models, web services using Flask, Python virtual environments, Docker, and deployment to the cloud using AWS Elastic Beanstalk. 6. Decision Trees and Ensemble Learning: Participants learn about decision trees, ensemble learning, random forests, gradient boosting, and XGBoost. 7. Neural Networks and Deep Learning: This module introduces neural networks, pre-trained models, convolutional neural networks, transfer learning, regularization, and data augmentation. 8. Serverless Deep Learning: Participants learn about serverless computing, AWS Lambda, TensorFlow Lite, creating Docker images, and deploying lambda functions. 9. Kubernetes and TensorFlow Serving: This module covers TensorFlow Serving, creating pre-processing services, running locally with Docker-compose, deploying to Kubernetes, and deploying to EKS. 10. KServe (optional): Participants have the option to learn about KServe, running it locally, deploying Scikit-Learn and TensorFlow models, and using KServe transformers. 11. Capstone Projects: Participants work on two capstone projects to apply their knowledge and skills to real-world scenarios. Overall, the curriculum covers a wide range of topics in machine learning engineering, providing participants with a comprehensive understanding of the field. PROMPT: To obtain a certificate from ML Zoomcamp, what are the specific requirements that participants need to fulfill? ANSWER: To obtain a certificate from ML Zoomcamp, participants need to fulfill the following requirements: 1. Attend all the live sessions: Participants are required to attend all the live sessions conducted during the program. These sessions provide additional insights, explanations, and opportunities for interaction with instructors and fellow participants. 2. Complete all the homework assignments: Participants need to complete all the homework assignments provided throughout the program. These assignments are designed to reinforce the concepts learned and provide hands-on experience with machine learning techniques. 3. Complete the capstone projects: Participants are required to complete two capstone projects, which involve applying the knowledge and skills gained from the program to real-world scenarios. These projects demonstrate the participant's ability to solve practical machine learning problems. 4. Pass the final exam: At the end of the program, participants need to pass a final exam that tests their understanding of the key concepts and topics covered in ML Zoomcamp. By fulfilling these requirements, participants will demonstrate their proficiency in machine learning engineering and will be eligible to receive a certificate from ML Zoomcamp.
Conclusion¶
The project showcased the potential of the GenAI Stack for efficient data extraction and analysis. By integrating various modules and AI components, it provided a seamless workflow for content retrieval and generation based on natural language queries. The GenAI Stack's versatility and adaptability make it a valuable tool for AI-driven data processing and content generation.
In the context of the ML Zoomcamp curriculum, the project effectively retrieved concise and informative answers to a set of specific questions. The model demonstrated an understanding of the program's overview, key concepts and topics covered, and the requirements for obtaining a certificate from ML Zoomcamp. This highlights the potential of using NLP models for educational and informational purposes, enabling users to quickly access relevant information.
............................................................................
Follow me on Twitter 🐦, connect with me on LinkedIn 🔗, and check out my GitHub 🐙. You won't be disappointed!
👉 Twitter: https://twitter.com/NdiranguMuturi1
👉 LinkedIn: https://www.linkedin.com/in/isaac-muturi-3b6b2b237
👉 GitHub: https://github.com/Isaac-Ndirangu-Muturi-749